CytofIN (CyTOF integration) is an R package for homogenizing and integrating heterogenous CyTOF data from diverse data sources.
Before CyTOF data integration, all CyTOF files need to be homogenized to have consistent channels. CytofIN requires that all input CyTOF files be homogenized based on a user-provided standardized panel with user defined search pattern. To normalize the CyTOF data, CytofIN uses a novel generalized anchor strategy that defines the based line of the signal between batch to correct for batch effects. One anchor needs to be identified by the user from each plate (batch). A reference anchor is generated based on the mean expression of all identified anchors from each plate (batch). Next, a user-specified transformation function is applied to fit each plate-specific anchor to the reference data distribution and the same transformation is then applied to correct the sample data signal on each plate.
CytofIN provided three functions for CyTOF data integration:
library(knitr)
hook_output <- knit_hooks$get("output")
knit_hooks$set(output = function(x, options) {
lines <- options$output.lines
if (is.null(lines)) {
return(hook_output(x, options)) # pass to default hook
}
x <- unlist(strsplit(x, "\n"))
more <- "..."
if (length(lines)==1) { # first n lines
if (length(x) > lines) {
# truncate the output, but add ....
x <- c(head(x, lines), more)
}
} else {
x <- c(more, x[lines], more)
}
# paste these lines together
x <- paste(c(x, ""), collapse = "\n")
hook_output(x, options)
})
library(devtools)
install_github('bennyyclo/Cytofin')
#> Skipping install of 'cytofin' from a github remote, the SHA1 (a9cf729d) has not changed since last install.
#> Use `force = TRUE` to force installation
CyTOF data homogenization
Description: the homognize function takes a user input antigen panel table, which includes standardized antigen name and associated antigen search pattern. Given two CyTOF files with distinct antigen naming, the program performs a regular expression search to match the synonymous term in the panel and correct the antigen name with standardized names in the panel.
Function Definition:
homogenize(metadata_filename, panel_filename, input_file_dir, output_file_dir)
Input:
metadata_file: metadata table of raw CyTOF files (.fcs)(must be in the current directory).
panel_filename: standardized antigen panel table file (.xlsx/.csv)(must be in the current directory).
input_file_dir: folder directory containing input raw CyTOF files.
output_file_dir: folder directory containing output homogenized files.
Output: homogenized CyTOF file with user-defined channels presented in the standardized antigen table.
\(~\)
CyTOF data normalization using external anchors
The external anchor normalization steps include: 1. preparation of external anchors and 2. application of transformation function.
\(~\)
Description: the anprep function concatenates the identified anchor file, one file per plate/batch, and subsequently generates summary statistics including mean and variance which will be used for batch correction.
Function definition:
anprep(metadata_filename, panel_filename, input_file_dir)
Input:
metadata_filename: metadata table of anchor CyTOF files (.fcs)(must be in the current directory).
panel_filename: standardized antigen panel table file (.xlsx/.csv)(must be in the current directory).
input_file_dir: folder directory containing output data.
Output: an RData object containing reference statistics of the universal reference and concatenated anchor FCS files.
The RData object stored the following variables regarding the universal reference. The variables are exported from the RData object for subsequent batch normalization:
mean_uni: a 1-dimensional array of mean expression for all input markers
mean_var: a 1-dimensional array of mean variance values of all marker expressions.
mean_uni_mean: the mean value of mean_uni array.
mean_uni_var: the mean value of mean_var array.
\(~\)
Description: the annorm function applied different transformation functions (modes) to normalize each anchor to the referenece statistcs generated by the anprep function.
Function definition:
annorm (control_metadata_filename, control_data_filename, sample_metadata_filename, panel_filename, input_file_dir, val_file_dir="none" ,output_file_dir, mode)
Input:
control_metadata_file: metadata table of anchor CyTOF files (.fcs)(must be in the current directory).
control_data_filename: RData object containing anchor referene statistics (must be in the current directory).
sample_metadata_filename: metadata table of homogenized CyTOF files (.fcs)(must be in the current directory).
panel_filename: standardized antigen panel table file (.xlsx/.csv)(must be in the current directory).
input_file_dir: folder directory containing input homogenized CyTOF data file.
val_file_dir: folder directory containing validation homogenized CyTOF data file (optional).
output_file_dir: folder directory containing output normalized CyTOF data file.
mode: transformation function used for normaliztion (meanshift, meanshift_bulk, variance, z_score, beadlike).
Output: normalized CyTOF files.
\(~\)
CyTOF data normalization using internal anchors
Description: In the event that the external references are not available, internal anchors can be used. Here, we identified the most stable channels as internal anchors using a PCA-based non-redundnacy score (NRS). A minimal of three channels should be selected to establish an internal refernece from which signal can be calibrated between CyTOF files.
Function definition:
annorm_nrs(sample_metadata_filename, panel_filename, input_file_dir, val_file_dir="none", output_file_dir, nchannels)
Input:
sample_meta_filename: metadata table of homogenized CyTOF files (.fcs)(must be in the current directory).
panel_filename: standardized antigen panel table file (.xlsx/.csv)(must be in the current directory).
val_file_dir: folder directory containing validation homogenized CyTOF data file (optional).
output_file_dir: folder directory containing output normalized CyTOF data file.
nchannels: number of stabilized channels used for normalization.
Output: normalized CyTOF files.
\(~\)
Computational pipeline for CyTOF data integration
Below is a demo Rscript using CytofIN package for CyTOF data integration.
First, let's homogenize the panel of all FCS files using the homogenize function:
#import cytofin R package
library(cytofin)
#homogenization antigen panel, use the demo data supplied with the package
metadata_filename <- paste0(path.package("cytofin"),"/extdata/test_metadata_raw.csv")
panel_filename <- paste0(path.package("cytofin"),"/extdata/test_panel.csv")
input_file_dir_homogenize <- paste0(path.package("cytofin"),"/extdata/test_raw_fcs_files/")
output_file_dir_homogenize <- "out_test/"
homogenize(metadata_filename, panel_filename, input_file_dir_homogenize, output_file_dir_homogenize)
#> Warning in dir.create(output_file_dir): 'out_test' already exists
#> filename cohort plate_number patient_id condition
#> 1 ALL05v2_Plate2_UPN94 das.fcs ALL05v2 plate2 UPN94 Das
#> 2 ALL08_Plate8_UPN26 basal.fcs ALL08 plate8 UPN26 Basal
#> 3 CRLF2_Plate1_UPN53 das + TSLP.fcs CRLF2 plate1 UPN53 das_TSLP
#> 4 ALL05v2_Plate2_healthy basal1.fcs ALL05v2 plate2 Healthy Basal
#> 5 ALL08_Plate8_Healthy03 basal.fcs ALL08 plate8 Healthy03 Basal
#> 6 CRLF2_Plate1_Healthy 04 BCR.fcs CRLF2 plate1 Healthy04 BCR
#> 7 MS_Plate5_SU978 Basal.fcs MajSak plate5 SU978 Basal
#> 8 MS_Plate5_Healthy BM.fcs MajSak plate5 Healthy BM
#> 9 SJ_Plate2_TB010950_Basal.fcs StJude plate2 TB010950 Basal
#> 10 SJ_Plate2_Healthy_BM.fcs StJude plate2 Healthy BM
#> population validation
#> 1 <NA> homogenized_ALL05v2_plate2_UPN94 das.fcs
#> 2 <NA> homogenized_ALL08_plate8_UPN26 basal.fcs
#> 3 <NA> homogenized_CRLF2_plate1_UPN53 das + TSLP.fcs
#> 4 1 homogenized_ALL05v2_plate2_healthy basal1.fcs
#> 5 <NA> homogenized_ALL08_plate9_Healthy03 basal.fcs
#> 6 <NA> homogenized_CRLF2_plate1_Healthy 04 BCR.fcs
#> 7 <NA> homogenized_MajSak_plate5_SU978 Basal.fcs
#> 8 <NA> homogenized_MajSak_plate5_Healthy BM.fcs
#> 9 <NA> homogenized_StJude_plate2_TB010950_Basal.fcs
#> 10 <NA> homogenized_StJude_plate2_Healthy_BM.fcs
#> desc range metal_pattern antigen_pattern Lineage Functional
#> 1 Time Time [Tt]ime [Tt]ime 0 0
#> 2 Event_length Event_length ength ength 0 0
#> 3 (Pd102)Di BC1 Pd102 BC1 0 0
#> 4 (Pd104)Di BC2 Pd104 BC2 0 0
#> 5 (Pd105)Di BC3 Pd105 BC3 0 0
#> 6 (Pd106)Di BC4 Pd106 BC4 0 0
#> 7 (Pd108)Di BC5 Pd108 BC5 0 0
#> 8 (Pd110)Di BC6 Pd110 BC6 0 0
#> 9 (In113)Di CD235_CD61 In113 CD235 1 0
#> 10 (In115)Di CD45 In115 CD45 1 0
#> 11 (La139)Di cPARP La139 PARP 0 1
#> 12 (Pr141)Di pPLCg1_2 Pr141 pPLCg1_2 0 1
#> 13 (Nd142)Di CD19 Nd142 CD19 1 0
#> 14 (Nd143)Di CD22 Nd143 CD22 1 0
#> 15 (Nd144)Di p4EBP1 Nd144 p4EBP1 0 1
#> 16 (Nd145)Di tIkaros Nd145 tIkaros 1 0
#> 17 (Nd146)Di CD79b Nd146 CD79b 1 0
#> 18 (Sm147)Di CD20 [PS]m147 CD20 1 0
#> 19 (Nd148)Di CD34 Nd148 CD34 1 0
#> 20 (Sm149)Di CD179a Sm149 CD179a 1 0
#> 21 (Nd150)Di pSTAT5 Nd150 pSTAT5 0 1
#> 22 (Sm152)Di Ki67 Sm152 Ki67 0 1
#> 23 (Eu153)Di IgMi Eu153 IgMi 1 0
#> 24 (Sm154)Di Kappa_lambda Sm154 appa 0 1
#> 25 (Gd156)Di CD10 Gd156 CD10 1 0
#> 26 (Gd158)Di CD179b Gd158 CD179b 1 0
#> 27 (Gd160)Di CD24 Gd160 CD24 1 0
#> 28 (Dy161)Di TSLPr Dy161 TSLPr 0 1
#> 29 (Dy162)Di CD127 Dy162 CD127 1 0
#> 30 (Dy163)Di RAG1 Dy163 RAG1 1 0
#> 31 (Dy164)Di TdT Dy164 Td 1 0
#> 32 (Ho165)Di Pax5 Ho165 Pax5 1 0
#> 33 (Er166)Di pSyk Er166 pSyk 0 1
#> 34 (Er167)Di CD43 Er167 CD43 1 0
#> 35 (Er168)Di CD38 Er168 CD38 1 0
#> 36 (Er170)Di CD3 Er170 CD3^ 1 0
#> 37 (Yb171)Di CD33 Yb171 FITC|CD33 0 1
#> 38 (Yb172)Di pS6 Yb172 pS6 0 1
#> 39 (Yb173)Di pErk Yb173 pErk 0 1
#> 40 (Yb174)Di HLADR Yb174 HLADR 1 0
#> 41 (Lu175)Di IgMs Lu175 IgMs 1 0
#> 42 (Yb176)Di pCreb [YbLu]176 pCreb 0 1
#> 43 (Ir191)Di DNA1 Ir191 DNA1 0 1
#> 44 (Ir193)Di DNA2 Ir193 DNA2 0 1
#> General
#> 1 1
#> 2 1
#> 3 1
#> 4 1
#> 5 1
#> 6 1
#> 7 1
#> 8 1
#> 9 0
#> 10 0
#> 11 0
#> 12 0
#> 13 0
#> 14 0
#> 15 0
#> 16 0
#> 17 0
#> 18 0
#> 19 0
#> 20 0
#> 21 0
#> 22 0
#> 23 0
#> 24 0
#> 25 0
#> 26 0
#> 27 0
#> 28 0
#> 29 0
#> 30 0
#> 31 0
#> 32 0
#> 33 0
#> 34 0
#> 35 0
#> 36 0
#> 37 0
#> 38 0
#> 39 0
#> 40 0
#> 41 0
#> 42 0
#> 43 0
#> 44 0
#> uneven number of tokens: 529
#> The last keyword is dropped.
#> uneven number of tokens: 529
#> The last keyword is dropped.
#> filename: ALL05v2_Plate2_UPN94 das.fcs
#> 1
#> matched data_antigen: Time ref_antigen: Time ref_antigen_pattern [Tt]ime
#> 2
#> matched data_antigen: Event_length ref_antigen: Event_length ref_antigen_pattern ength
#> 3
#> matched data_antigen: BC1 ref_antigen: BC1 ref_antigen_pattern BC1
#> 4
#> matched data_antigen: BC2 ref_antigen: BC2 ref_antigen_pattern BC2
#> 5
#> matched data_antigen: BC3 ref_antigen: BC3 ref_antigen_pattern BC3
#> 6
#> matched data_antigen: BC4 ref_antigen: BC4 ref_antigen_pattern BC4
#> 7
#> matched data_antigen: BC5 ref_antigen: BC5 ref_antigen_pattern BC5
#> 8
#> matched data_antigen: BC6 ref_antigen: BC6 ref_antigen_pattern BC6
#> 9
#> matched data_antigen: CD235_CD61 ref_antigen: CD235_CD61 ref_antigen_pattern CD235
#> 10
#> matched data_antigen: CD45 ref_antigen: CD45 ref_antigen_pattern CD45
#> 11
#> matched data_antigen: cPARP ref_antigen: cPARP ref_antigen_pattern PARP
#> 12
#> matched data_antigen: pPLCg1_2 ref_antigen: pPLCg1_2 ref_antigen_pattern pPLCg1_2
#> 13
#> matched data_antigen: CD19 ref_antigen: CD19 ref_antigen_pattern CD19
#> 14
#> matched data_antigen: CD22 ref_antigen: CD22 ref_antigen_pattern CD22
#> 15
#> matched data_antigen: p4EBP1 ref_antigen: p4EBP1 ref_antigen_pattern p4EBP1
#> 16
#> matched data_antigen: tIkaros ref_antigen: tIkaros ref_antigen_pattern tIkaros
#> 17
#> matched data_antigen: CD79b ref_antigen: CD79b ref_antigen_pattern CD79b
#> 18
#> matched data_antigen: CD20 ref_antigen: CD20 ref_antigen_pattern CD20
#> 19
#> matched data_antigen: CD34 ref_antigen: CD34 ref_antigen_pattern CD34
#> 20
#> matched data_antigen: CD179a ref_antigen: CD179a ref_antigen_pattern CD179a
#> 21
#> matched data_antigen: pSTAT5 ref_antigen: pSTAT5 ref_antigen_pattern pSTAT5
#> 22
#> matched data_antigen: Ki67 ref_antigen: Ki67 ref_antigen_pattern Ki67
#> 23
#> matched data_antigen: IgMi ref_antigen: IgMi ref_antigen_pattern IgMi
#> 24
#> matched data_antigen: Kappa_lambda ref_antigen: Kappa_lambda ref_antigen_pattern appa
#> 25
#> matched data_antigen: CD10 ref_antigen: CD10 ref_antigen_pattern CD10
#> 26
#> matched data_antigen: CD179b ref_antigen: CD179b ref_antigen_pattern CD179b
#> 27
#> matched data_antigen: CD24 ref_antigen: CD24 ref_antigen_pattern CD24
#> 28
#> matched data_antigen: TSLPr ref_antigen: TSLPr ref_antigen_pattern TSLPr
#> 29
#> matched data_antigen: CD127 ref_antigen: CD127 ref_antigen_pattern CD127
#> 30
#> matched data_antigen: RAG1 ref_antigen: RAG1 ref_antigen_pattern RAG1
#> 31
#> matched data_antigen: TdT ref_antigen: TdT ref_antigen_pattern Td
#> 32
#> matched data_antigen: Pax5 ref_antigen: Pax5 ref_antigen_pattern Pax5
#> 33
#> matched data_antigen: pSyk ref_antigen: pSyk ref_antigen_pattern pSyk
#> 34
#> matched data_antigen: CD43 ref_antigen: CD43 ref_antigen_pattern CD43
#> 35
#> matched data_antigen: CD38 ref_antigen: CD38 ref_antigen_pattern CD38
#> 36
#> matched data_antigen: ref_antigen: CD3 ref_antigen_pattern CD3^
#> 37
#> matched data_antigen: FITC_myeloid ref_antigen: CD33 ref_antigen_pattern FITC|CD33
#> 38
#> matched data_antigen: pS6 ref_antigen: pS6 ref_antigen_pattern pS6
#> 39
#> matched data_antigen: pErk ref_antigen: pErk ref_antigen_pattern pErk
#> 40
#> matched data_antigen: HLADR ref_antigen: HLADR ref_antigen_pattern HLADR
#> 41
#> matched data_antigen: IgMs ref_antigen: IgMs ref_antigen_pattern IgMs
#> 42
...
This step homogenized the marker names of your target samples to the names indicated in your test_panel.csv file.
Next, we will generate an RData object which contains statistics of the generalized anchors.
#prep external anchor
anchor_metadata_filename <- paste0(path.package("cytofin"),"/extdata/test_anchor_metadata_raw.csv")
input_file_dir_anprep <- output_file_dir_homogenize #use the homogenized files
anprep(anchor_metadata_filename, panel_filename, input_file_dir_anprep)
#> [1] "concatenated_control_untransformed.fcs"
Time to perform the batch normalization using CytofIN normalization function using healthy control samples as generalized anchors:
#data normalization using external anchors and meanshift transofmration function
val_file_dir <- paste0(path.package("cytofin"),"/extdata/test_batch_fcs_files/")
anchor_data_filename <- "./Prep_control.RData"
output_file_dir_annorm <- "norm_test/"
mode <- "meanshift"
annorm(anchor_metadata_filename, anchor_data_filename, metadata_filename, panel_filename,
input_file_dir_anprep, val_file_dir, output_file_dir_annorm, mode)
#> Warning in dir.create(output_file_dir): 'norm_test' already exists
#> ALL05v2_Plate2_UPN94 das.fcs
#> ALL08_Plate8_UPN26 basal.fcs
#> CRLF2_Plate1_UPN53 das + TSLP.fcs
#> ALL05v2_Plate2_healthy basal1.fcs
#> ALL08_Plate8_Healthy03 basal.fcs
#> CRLF2_Plate1_Healthy 04 BCR.fcs
#> MS_Plate5_SU978 Basal.fcs
#> MS_Plate5_Healthy BM.fcs
#> SJ_Plate2_TB010950_Basal.fcs
#> SJ_Plate2_Healthy_BM.fcs
You can see samples get better after normalization when normalized signal (dark green) moved closer to the validation signal (purple).
Now let us try to normalize with stable channels as generalized anchors:
#data normalization using 4 internal channels and meanshift_bulk transformation function
nchannels <- 4
output_file_dir_annorm_nrs <- "norm_test2/"
annorm_nrs(metadata_filename, panel_filename, input_file_dir_anprep, val_file_dir,
output_file_dir_annorm_nrs, nchannels)
#> Warning in dir.create(output_file_dir): 'norm_test2' already exists
#> Warning: `fun.y` is deprecated. Use `fun` instead.
#> ALL05v2_Plate2_UPN94 das.fcs
#> ALL08_Plate8_UPN26 basal.fcs
#> CRLF2_Plate1_UPN53 das + TSLP.fcs
#> ALL05v2_Plate2_healthy basal1.fcs
#> ALL08_Plate8_Healthy03 basal.fcs
#> CRLF2_Plate1_Healthy 04 BCR.fcs
#> MS_Plate5_SU978 Basal.fcs
#> MS_Plate5_Healthy BM.fcs
#> SJ_Plate2_TB010950_Basal.fcs
#> SJ_Plate2_Healthy_BM.fcs